Brooklyn real estate sales analysis by Liya Naumova

I’ve chosen data about Brooklyn real estate sales in 2015. It consist of more than 23 000 observations of 21 variables.

## 'data.frame':    23223 obs. of  21 variables:
##  $ borough                       : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ neighborhood                  : Factor w/ 60 levels "BATH BEACH","BAY RIDGE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ building.class.category       : Factor w/ 44 levels "01  ONE FAMILY DWELLINGS",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ tax.class.at.present          : Factor w/ 10 levels "1","1A","1B",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ block                         : Factor w/ 5525 levels "20","27","28",..: 3910 3911 3919 3920 3922 3922 3937 3938 3938 3940 ...
##  $ lot                           : Factor w/ 1083 levels "1","2","3","4",..: 22 17 60 48 49 51 39 8 108 23 ...
##  $ ease-ment                     : chr  " " " " " " " " ...
##  $ building.class.at.present     : Factor w/ 128 levels "A1","A2","A3",..: 5 5 7 106 106 106 1 106 106 5 ...
##  $ address                       : chr  "8647 15TH AVENUE" "55 BAY 10TH   STREET" "8620 19TH   AVENUE" "1906 86TH   STREET" ...
##  $ apartment.number              : chr  NA NA NA NA ...
##  $ zip.code                      : Factor w/ 39 levels "11201","11203",..: 27 27 13 13 13 13 13 13 13 13 ...
##  $ residential.units             : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ commercial.units              : num  0 0 0 1 1 1 0 1 1 0 ...
##  $ total.units                   : num  1 1 1 2 2 2 1 2 2 1 ...
##  $ land.square.feet              : num  1547 1933 2417 1900 1725 ...
##  $ gross.square.feet             : num  1428 1660 2106 2090 2112 ...
##  $ year.built                    : num  1930 1930 1930 1931 1925 ...
##  $ tax.class.at.time.of.sale     : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ building.class.at.time.of.sale: Factor w/ 132 levels "A0","A1","A2",..: 6 6 8 110 110 110 2 110 110 6 ...
##  $ sale.price                    : num  758000 778000 0 1365000 1470000 ...
##  $ sale.date                     : Date, format: "2015-03-31" "2015-06-15" ...
##   borough neighborhood  building.class.category tax.class.at.present block
## 1       3   BATH BEACH 01  ONE FAMILY DWELLINGS                    1  6360
## 2       3   BATH BEACH 01  ONE FAMILY DWELLINGS                    1  6361
## 3       3   BATH BEACH 01  ONE FAMILY DWELLINGS                    1  6371
## 4       3   BATH BEACH 01  ONE FAMILY DWELLINGS                    1  6372
## 5       3   BATH BEACH 01  ONE FAMILY DWELLINGS                    1  6374
## 6       3   BATH BEACH 01  ONE FAMILY DWELLINGS                    1  6374
##   lot ease-ment building.class.at.present              address
## 1  22                                  A5     8647 15TH AVENUE
## 2  17                                  A5 55 BAY 10TH   STREET
## 3  60                                  A9   8620 19TH   AVENUE
## 4  48                                  S1   1906 86TH   STREET
## 5  49                                  S1   1964 86TH   STREET
## 6  51                                  S1   1970 86TH   STREET
##   apartment.number zip.code residential.units commercial.units total.units
## 1             <NA>    11228                 1                0           1
## 2             <NA>    11228                 1                0           1
## 3             <NA>    11214                 1                0           1
## 4             <NA>    11214                 1                1           2
## 5             <NA>    11214                 1                1           2
## 6             <NA>    11214                 1                1           2
##   land.square.feet gross.square.feet year.built tax.class.at.time.of.sale
## 1             1547              1428       1930                         1
## 2             1933              1660       1930                         1
## 3             2417              2106       1930                         1
## 4             1900              2090       1931                         1
## 5             1725              2112       1925                         1
## 6             1725              2112       1931                         1
##   building.class.at.time.of.sale sale.price  sale.date
## 1                             A5     758000 2015-03-31
## 2                             A5     778000 2015-06-15
## 3                             A9          0 2015-09-16
## 4                             S1    1365000 2015-05-29
## 5                             S1    1470000 2015-05-06
## 6                             S1    1790000 2015-04-30

Univariate Plots Section

Price

At first I decided to look at price distribution

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##         0         0    350000    802800    825000 169000000

It looks like we have zero values for about a quarter of rows and now I want to look at lowest values

## 
##    0    1    5    7    9   10 
## 8157   56    1    1    1  227

Then I exclude zeros and repeat summary

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##         1    378000    660000   1237000   1100000 169000000

Now I’m interested in those $1 properties

##              building.class.category                  address year.built
## 1           01  ONE FAMILY DWELLINGS          415 99TH STREET       1899
## 2    07  RENTALS - WALKUP APARTMENTS     263 BAY RIDGE AVENUE       1931
## 3           02  TWO FAMILY DWELLINGS     523 LEXINGTON AVENUE       1993
## 4         03  THREE FAMILY DWELLINGS         109A HART STREET       2005
## 5    07  RENTALS - WALKUP APARTMENTS   1077-79 BEDFORD AVENUE       1931
## 6    07  RENTALS - WALKUP APARTMENTS        165 QUINCY STREET       1931
## 7  08  RENTALS - ELEVATOR APARTMENTS         273 GATES AVENUE       1920
## 8            14  RENTALS - 4-10 UNIT 260 MARCUS GARVEY BOULEV       1931
## 9           01  ONE FAMILY DWELLINGS         1829 51ST STREET       1920
## 10    12  CONDOS - WALKUP APARTMENTS      3822A 15TH   AVENUE         NA
##    gross.square.feet sale.price
## 1                652          1
## 2               5880          1
## 3               1802          1
## 4               3093          1
## 5              11520          1
## 6               5580          1
## 7              44460          1
## 8               4056          1
## 9               1344          1
## 10                 0          1

Other values look normal so I assume it’s a kind of fictitious price

Let’s look at histogram with prices divided by 1000. There is a distribution with very long tale. Let’s look at all prices lower than 5 mln.

Most prices are distributed between 1000 and 1 000 000 with peaks around 500 000, 950 000 and 1 250 000, also we have much more prices that are slightly lower then 1 mln than that are slightly higher. After log transformation prices look nearly normal but some exceptions that are lower than 1000.

Gross sqare feet

Next feature of interest for me is Gross square feet.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    1728    2889    2870  366000

Looks like here are also lots of zeros. Summary without zeros:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      65    1800    2480    4501    3435  366000

Let’s look at distribution of values greater than 0

Here also we have very long tail. I focus on values between 1 and 10000. Most values are between 1000 and 3500 square feet and peak is around 2000. Now I want to look at it after log transformation.

Now it looks more like normal but still with long tails.

Land sqare feet

Summary:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    1900    2223    2500  293000

Looks like here we also have lots of zeros and outliers.

Summary without zeros:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      20    1882    2083    3284    2750  293000

Histogram:

Most values are distributed between 1500 and 3000 with spikes at round numbers (2000, 2500, 3000 …). The most common value is 2000.

After log transformation:

I decided to make new variable “total.square.feet” as a sum of “land.square.feet” and “gross.square.feet”.

Total square feet

summary of new variable:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    3818    5112    5265  589300

I still have 7500 zero values. Summary of non-zero values:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      20    3780    4700    7551    6116  589300

Obviously I have here similar distribution with very long tale.

Price per square feet

Then I want to make another variable - price per square foot

Summary of non-zero values:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.01   110.40   166.80   219.60   244.40 12060.00

distribution:

Year built

Next feature of interest is year.built

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1800    1915    1930    1940    1960    2016    1993

It’s distribution:

Most buildings were built between 1895 and 1935 with other peaks in 1950 - 1956 and 2005-2015. With binwidth = 1 it is possible to notice spikes on round years (1900, 1920…). I assume some of these values are approximate.

Residential units

Table of values’ counts:

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 5608 6767 6104 2292  827  162  549   90  234   54   31    7   49   11   12 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
##   12   47   14   14   12   37   13    9   10   18    5    6    6    8    6 
##   30   31   32   33   34   35   36   38   39   40   41   42   43   44   45 
##    3   13   13    4    4    8    5   10   10    5    5    5    2    3    2 
##   47   48   49   50   51   52   53   54   55   56   58   59   60   61   62 
##    5    8    2    4    1    4    2    4    1    2    5    1    2    1    3 
##   63   64   65   66   67   68   69   70   72   74   75   77   78   79   81 
##    2    1    1    3    1    3    1    2    2    3    1    3    2    1    1 
##   82   83   84   89   90   92   93   95   96  102  103  104  107  108  112 
##    2    1    5    2    1    1    1    1    1    3    1    1    1    1    1 
##  114  118  119  120  121  126  131  133  169  172  178  190  200  225  234 
##    1    1    2    1    1    1    1    2    1    1    1    2    1    1    1 
##  268  270  334  338 
##    1    1    1    1

Histogram:

Most properties have from 0 to 5 residential units.

Commercial units

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 20832  1827   321   100    61    34    16     7     4     1     3     1 
##    12    13    15    16    24    28    29    30    54   201   355 
##     2     1     1     2     4     1     1     1     1     1     1

Histogram:

Most properties have 0 commercial units.

Total units

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 3665 8211 5882 2743  845  262  590  105  266   59   53   17   46   19    6 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
##   11   56   14   15   11   36   18    6   14   20    8    7    1   13   10 
##   30   31   32   33   34   35   36   37   38   39   40   41   42   43   44 
##    5   11   11    2    3    8    4    3    8    9    4    8    7    1    5 
##   45   47   48   49   50   51   52   53   54   56   58   59   60   61   62 
##    3    5    9    2    4    1    5    1    4    2    5    3    2    1    3 
##   63   64   65   66   67   68   69   70   72   75   77   78   79   81   82 
##    2    1    1    3    1    3    1    3    2    3    3    2    2    1    1 
##   83   84   85   86   89   90   94   95   96   97  102  103  104  107  108 
##    1    4    1    1    1    2    1    1    1    1    3    1    1    1    1 
##  112  114  118  120  121  123  126  131  135  169  172  178  192  200  201 
##    1    1    1    1    1    2    1    1    2    1    1    1    2    1    1 
##  225  237  270  335  339  355 
##    1    1    2    1    1    1

Histogram:

Most properties have between 0 and 5 total units.

Categorical variables

Borough

## 
##     3 
## 23223

All observarions of this variable have the same value

Neighborhood

Categorical variable with 60 levels

I’ve noticed that neighborhoods have different representation in the dataset. Some of them have small number of observations.

Building class category

Categorical with 44 levels. Ten most common values:

## 
##           02  TWO FAMILY DWELLINGS           01  ONE FAMILY DWELLINGS 
##                               6015                               3107 
##    10  COOPS - ELEVATOR APARTMENTS         03  THREE FAMILY DWELLINGS 
##                               2232                               2173 
##   13  CONDOS - ELEVATOR APARTMENTS    07  RENTALS - WALKUP APARTMENTS 
##                               1998                               1925 
## 15  CONDOS - 2-10 UNIT RESIDENTIAL                  44  CONDO PARKING 
##                                722                                695 
##      09  COOPS - WALKUP APARTMENTS             04  TAX CLASS 1 CONDOS 
##                                569                                526

Most properties are one-three family dwellings or condos.

Tax class at present

Categorical with 10 levels

## 
##     1    1A    1B    1C     2    2A    2B    2C     3     4 
## 11295   393   314   134  5401  1609   465  1046     2  2503

Block

I transformed this to factor which has 5525 levels . Most frequent values are:

## 
## 8720 1890 2135 1896  152  286 7279 1217 2324 2348 
##  135  109   96   95   91   90   69   54   53   53

Lot

I transformed it to factor too and received 1083 different values. Most frequent of them are:

## 
##   1  11   6  12  18  21  17  14  35  13 
## 795 322 314 288 288 283 282 280 277 275

Easement

## 
##       
## 23223

All observations have empty values.

Building class at present

Factor with 128 levels. Most common values:

## 
##   D4   B1   C0   R4   B3   B2   A5   B9   A1   A9 
## 2230 2222 2164 1994 1251 1146  976  900  776  744

All values:

## 
##   A1   A2   A3   A4   A5   A7   A9   B1   B2   B3   B9   C0   C1   C2   C3 
##  776  188   53  173  976    2  744 2222 1146 1251  900 2164  525  577  663 
##   C4   C5   C6   C7   C8   C9   D0   D1   D3   D4   D5   D6   D7   D8   D9 
##   34   31  563  127    6    7    2  106   13 2230    7   18   34    2    7 
##   E1   E2   E7   E9   F1   F2   F4   F5   F9   G0   G1   G2   G4   G5   G6 
##   52    5    5  110   11    3   10   13   77   31   67   24    7    7   10 
##   G7   G8   G9   GU   GW   I1   I4   I5   I6   I7   I9   J9   K1   K2   K4 
##  173    4   40    4    3    3    6    5    4    7    2    1  135   83  149 
##   K5   K6   K7   K9   L1   L8   L9   M1   M2   M3   M4   M9   N2   N9   O1 
##    8    1    3    1    1    1    3   22    1    2    1   12    2    4   11 
##   O2   O5   O7   O8   O9   P3   P5   P6   P9   Q9   R0   R1   R2   R3   R4 
##   21   28   23   11    4    2    1    1    3    1    1  731  341  393 1994 
##   R5   R6   R7   R8   R9   RA   RB   RG   RK   RP   RR   RS   RT   RW   S0 
##   10  133    1   47   37    3   63  397   24  300    7  124   23   41    6 
##   S1   S2   S3   S4   S5   S9   T9   U7   U8   V0   V1   V2   V3   V5   V9 
##  179  479  104  101   74  132    1    1    1  303  199    5    6    3    9 
##   W1   W2   W3   W8   W9   Y1   Z0   Z9 
##    1    5    2    2   10    1    5   97

Address and apartment number

Character values that should represent uniqe buildings or apartments, I want to see if any of them repeat

## 
## 163 WASHINGTON AVENUE    185 PACIFIC STREET     388 BRIDGE STREET 
##                   106                    85                    63 
##    143 CLASSON AVENUE       184 KENT AVENUE 
##                    59                    53

ZIP code

factor with 39 levels that can represent geographical location of building

Tax class at time of sale

Factor variable with 4 levels

## 
##     1     2     3     4 
## 12176  8448     2  2597

Most values are of class 1

Building class at time of sale

Factor variable with 132 levels. Most frequent:

## 
##   D4   B1   C0   R4   B3   B2   A5   B9   A1   A9   R1   C3   C2   C6   C1 
## 2230 2209 2173 1998 1256 1159  976  906  778  752  722  647  568  563  509 
##   S2   RG   R3   R2   V0   RP   V1   G7   A2   S1   A4   K1   K4   R6   S9 
##  485  407  390  343  306  288  219  216  191  180  167  140  136  135  133 
##   C7   RS   E9   S3   S4   Z9   D1   K2   F9   S5   G9   RB   A3   R8   G2 
##  130  122  114  104  102   97   95   84   79   74   70   66   53   51   48 
##   RW   E1   R9   C4   G1   G0   O7   C5   D7   O9   M1   RK   RT   G6   R5 
##   41   39   37   34   32   31   31   30   30   26   23   21   21   16   15 
##   E3   F4   F5   F1   M9   O1   O5   W9   O2   O8   RR   D6   K5   V3   C9 
##   14   14   14   13   12   12   12   10    9    9    9    8    8    8    7 
##   D9   G4   G5   I7   C8   D5   I4   K9   S0   V9   E7   I5   V2   Z0   F2 
##    7    7    7    7    6    6    6    6    6    6    5    5    5    5    4 
##   I6   I9   N9   P9   W2   A7   E2   I1   K7   L9   V5   W8   D0   GU   M3 
##    4    4    4    4    4    3    3    3    3    3    3    3    2    2    2 
##   N2   P3   RA   W3   A0   D3   D8   E4   G8   GW   J9   K6   L1   L8   M2 
##    2    2    2    2    1    1    1    1    1    1    1    1    1    1    1 
##   M4   O6   P5   P6   Q9   R0   R7   T9   U7   U8   W1   Y1 
##    1    1    1    1    1    1    1    1    1    1    1    1

Sale date

I should have sales for the whole 2015 year.

Obviously most sales are made on week days with spike on June, 30 and a decline to the end of year.

Univariate Analysis

What is the structure of your dataset?

There are 23223 property sales with 23 variables in this dataset. 8157 rows don’t have information about price. For others mean price is $1237000 and mean is $660000. Most properties are between 1000 and 3500 square feet and have between 1500 and 3000 square feet of land. Most buildings were constructed between 1895 and 1935.

What is/are the main feature(s) of interest in your dataset?

Main features for me are price, gross square feet, land square feet.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other features that I find interesting are year built, numbber of units, sale date, zipcode.

Did you create any new variables from existing variables in the dataset?

I created a variable “total square feet” for sum of gross and land square feet and a variable for price per square feet.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I’ve deleted zeros and obvious mistakes from “year.built”, for plots I deleted price by 1000 and made logarithmic transformations for price and area variables because of long-tailed distributions.

Bivariate Plots Section

Price and gross square feet

First I want to plot price vs gross square feet

Filter out zero values and zoom in:

Here we can see some vertical bands for round numbers and a horizontal stripe around $1000000 and also a lot of variance in price for the same square feet. Smothing layer shows that at average larger properties cost more. I want to calculate the correlation coeffitient:

##       cor 
## 0.5014883

There is positive correlation, but not very strong.

Price vs. land square feet

I delete zero values and look closer

Here also we can see strong vertical bands and a lot of differences in prices for the same square feet.

##       cor 
## 0.3504385

There is low positive correlation.

Price vs. total square feet

Now I filter out zero values and add linear regression line:

As expected larger properties at average cost more, but there are also a lot of variance due to other variables.

Corellation coefficient for total square feet and price:

##       cor 
## 0.4821405

Correlation is lower than 0.5 suggesting that variables are correlated but not very strong.

Price per square foot vs. total square feet

Now I’m interested in comparing price per square foot vs. total sqare feet, may be there is a difference in price for small or large properties.

Plot shows a lot of variance in price but no obvious increase or decrease.

Price per foot vs year.built

I can see more variance in prices for housed built in years 1899-1931 and 2000-2015 y but I’m not sure about changes in mean prices so I want to cut years by decades and make a boxplot.

There are small differences in median price for different decades. On average properties that were built in the beginnig of this and last century cost more than those built in the middle of last century.

Price per square foot vs. neighborhood

Obviously, a lot of difference in price per square foot is explaned by location. I want to make a boxplot after filtering out zero prices and zoom on prices lower than $2500:

Sale price per unit vs. neighborhood

Also we can see a lot of difference in sale prices per unit

Gross square feet vs. neighborhood

Next I want to look at distribution of floor area across neighborhoods:

For example, properties that are sold in Downtown - Fulton mall area are on average larger than in Windsor Terrace. But if I look at area per unit, differences are not so large:

I want to look closer to see difference:

Sale price vs. tax class

Now I want to look at for distribution of prices for different tax classes:

Price per square foot vs. tax class

Next - prices per square foot:

Distribution for class 2C looks rather different from others

Median price for class 2C is conciderably higher then others.

Tax class vs. Neighborhood

Most small residential properties (class 1) were sold in Bedford-Stuyvesant, condos (class 2) - in Park Slope, commercial (class 4) - in Bedford-Stuyvesant and Williamsburg-North. Next I want to look at proportions of different tax classes.

Building category vs. price per foot

There is some difference in median prices for diffferent building classes. For example indoor public and cultural facilities cost per foot more then educational facilities

Sale price vs. sale date

Among big sales there are stripes around specific dates like 2015.03.01 or 2015.06.30.

Red line shows mean price for every day of year. Now I want to look how price per foot changes with time

There is a slight increase to the end of year.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is positive correlation between price and gross and land square feet, obviously larger properties cost more, but it is not so stront to explaine all the variance in price. Houses built in the beginning and in the end of last century at average have higher prices per square foot than those built in the middle of the century. Price per square foot varies significantly in different neighborhoods. For tax class 2C price distribution looks different then others, has higer mean and variance. There is some difference in median prices for diffferent building classes. For example indoor public and cultural facilities cost per foot more then educational facilities

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is some difference in median prices for diffferent building classes. For example indoor public and cultural facilities cost per foot more then educational facilities. Higest number of small residential properties (class 1) were sold in Bedford-Stuyvesant, condos (class 2) - in Park Slope, commercial (class 4) - in Bedford-Stuyvesant and Williamsburg-North.

What was the strongest relationship you found?

Sale price positively corellates with floor and land area, also price is strongly related to neighborhood.

Multivariate Plots Section

Price, floor area & tax class

Tax class 1 points are mostly situated in lower left corner, class 2 in lower middle and class 3 are disperced around. Now I,m interested in floor area distribution of different tax classes:

Properties larger than 4000 feet are mostly commercial or condo, and most small residentials have area less than 5000 feet.

Next I divided price and square footage by number of units and used a log scale to see relation beetween price and size of one unit across tax classes.

First I noticed a group of points with prices lower than $1000, this looks strange to me, maybe these are mistakes. Looks like units of class 2 at average are cheaper and smaller than class 1 and units of class 4 are larger and more expensive. I wonder does price rise at the same rate as floor area or maybe in large properties one square foot cost less? I plotted price for square foot against floor area for unit in log scale:

There is some evidence of downward trend for tax class 2 and bigger commercial properties.

Price, floor area and year built

Now I’m interested if price and square footage depends on the year.

Looks like newer buildigs are slightly cheaper.

Price per unit, tax class , year built

Here I can notice that among the properties built in the beginning of the last century commercial (class 4) properties have higher prices than small residential (class 1) and condos (class 2). On the other hand, commercial properties built in 21st century mostly cost less than residential and condos.

Price per square foot, tax class , year built

ggplot(filter(sales, price.square.foot >0),
       aes(year.built.bucket,price.square.foot)) +
  geom_boxplot(aes(fill = tax.class.at.time.of.sale) ) +
  coord_cartesian(ylim = c(0,2000))+
  theme(axis.text.x = element_text(angle=60, hjust=1))

If I look at price per square foot I see no obvious pattern.

Price, tax class and neighborhood

Now I’ m interested in distribution of prices for unit across neighborhoods clored by tax class.

Here we can see that distribution varies significantly in different neighborhoods, for example, I notice clusters of lower price class 4 points in Bedford Stuyvesant and Williamsbourg and more expencive class 2 properties in the same Williamsbourg.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I noticed a group of points with prices lower than $1000, this looks strange to me, maybe these are mistakes. Looks like units of class 2 at average are cheaper and smaller than class 1 and units of class 4 are larger and more expensive. Properties larger than 4000 feet are mostly commercial or condo, and most small residentials have area less than 5000 feet. Price distribution varies significantly in different neighborhoods, for example, I notice clusters of lower price class 4 points in Bedford Stuyvesant and Williamsbourg and more expencive class 2 properties in the same Williamsbourg.

Were there any interesting or surprising interactions between features?

Among the properties built in the beginning of the last century commercial (class 4) properties have higher prices than small residential (class 1) and condos (class 2). On the other hand, commercial properties built in 21st century mostly cost less than residential and condos.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

On the log scale distribution looks almost normal with exeption of outliers lower than $1000. Most data are spread between $100 000 and $ 10 000 000 with the mode around $1 000 000. It is interesting that there are a lot more sales just below $1 mln than at $1 mln. I can suppose this is kind of a psycological number or it is connected with tax regulations.

Plot Two

Description Two

On this plot I can notice that among the properties built in the beginning of the last century commercial (class 4) properties have higher prices than small residential (class 1) and condos (class 2). On the other hand, commercial properties built in 21st century mostly cost less than residential and condos.

Plot Three

Description Three

This plot shows the relationship between unit sale price and floor area in different tax classes. Here we can see that properties of class 2 (condominiums and coops) mostly have area less than 1000 square feet and cost less than $1 mln. Commercial properties (class 4) are at average larger and cost more than $1 mln. Smooth layer shows that at general price goes up with increase in size, but we have a lot of variance due to other variables.


Reflection

This dataset contains information about c. 23000 real estates sold in Brooklyn, NY in 2015, described by 21 variables. Analysing individual variables I’ve found that main features of interest (sale price, gross square feet, land square feet) have significant proportion of missing values. Some categorical values had more than 20 levels and it made visualisation more difficult. As expected, I found that price and size variables have distributions with very long right tale which made me use log transformation.

After that I explored relationships between price and floor area, neghborhood, building year and tax class. Obviously price is positively correlated with floor area but a lot of variance depends on location. I’m interested in further exploration of reasons behind relationships between price and year built and price and tax class.

Also it would be interesting to build a model for price prediction and find some methods for imputation of missing data. As real estate prices change with time it will be in my opinion the main limitaion to use of the model built on this data.